Prefilter

2021-04-20

Introduction

Large and heterogeneous datasets may contain thousands of records missing spatial or taxonomic information (partially or entirely) as well as several records outside a region of interest or from doubtful sources. Such lower quality data are not fit for use in many research applications without prior amendments. The ‘Pre-filter’ step contains a series of of tests to detect, remove, and, whenever, possible, correct such erroneous or suspect records.


Important:

The results of VALIDATION test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.

Installation

You can install the released version of ‘BDC’ from github with:

if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")

Creating folders to save the results

bdc::bdc_create_dir()

Read the database

Read the merged database created in the step Standardization and integration of different datasets of the BDC workflow. It is also possible to read any datasets containing the required fields to run the workflow (more details here

database <-
  qs::qread("Output/Intermediate/00_merged_database.qs")

Standardization of character encoding

for (i in 1:ncol(database)){
  if(is.character(database[,i])){
    Encoding(database[,i]) <- "UTF-8"
  }
}



1 - Records missing species names

VALIDATION. This test flags records missing species names

check_pf <- bdc_scientificName_empty(
  data = database,
  sci_name = "scientificName")
#> 
#> bdc_scientificName_empty:
#> Flagged 324 records.
#> One column was added to the database.

2 - Records lacking information on geographic coordinates

VALIDATION. This test flags records missing partial or complete information on geographic coordinates.

check_pf <- bdc_coordinates_empty(
  data = check_pf,
  lat = "decimalLatitude",
  lon = "decimalLongitude")
#> 
#> bdc_coordinates_empty:
#> Flagged 1921 records.
#> One column was added to the database.

3 - Records with out-of-range coordinates

VALIDATION. This test flags records with out-of-range coordinates, that is latitude > 90 or -90; longitude >180 or -180.

check_pf <- bdc_coordinates_outOfRange(
  data = check_pf,
  lat = "decimalLatitude",
  lon = "decimalLongitude")
#> 
#> bdc_coordinates_outOfRange:
#> Flagged 23 records.
#> One column was added to the database.

4 - Records from distrustful sources

VALIDATION. This test flags records from doubtful source. For example, records from drawings, photographs, or multimedia objects, fossil records, among others.

check_pf <- bdc_basisOfRrecords_notStandard(
  data = check_pf,
  basisOfRecord = "basisOfRecord",
  names_to_keep = "all")
#> 
#> bdc_basisOfRrecords_notStandard:
#> Flagged 5 of the following specific nature:
#>  c("FOSSIL_SPECIMEN", "Extra", "Liqui") 
#> One column was added to the database.

5 - Getting country names from valid coordinates

ENRICHMENT. Deriving country names for records missing country names.

check_pf <- bdc_country_from_coordinates(
  data = check_pf,
  lat = "decimalLatitude",
  lon = "decimalLongitude",
  country = "country")
#> 
#> bdc_country_from_coordinates:
#> Country names were added to 1123 records.

6 - Standardizing country names and getting country code information

ENRICHMENT. Country names are standardized against a list of country names in several languages retrieved from Wikipedia.

check_pf <- bdc_country_standardized(
  data = check_pf,
  country = "country"
)
#> Loading auxiliary data: country names from wikipedia
#> Loading auxiliary data: world map and country iso
#> Standardizing country names
#> country found: Argentina
#> country found: Belize
#> country found: Bolivia
#> country found: Brazil
#> country found: Colombia
#> country found: Ecuador
#> country found: France
#> country found: French Guiana
#> country found: Guyana
#> country found: Honduras
#> country found: Japan
#> country found: Mexico
#> country found: Nicaragua
#> country found: Paraguay
#> country found: Suriname
#> country found: Uruguay
#> country found: Venezuela
#> 
#> bdc_country_standardized:
#> The country names of 8540 records were standardized.
#> Two columns were added to the database.

7 - Correcting latitude and longitude transposed

AMENDMENT. The mismatch between informed country and coordinates can be the result of negative or transposed coordinates. Once detected a mismatch, different coordinate transformations are made to correct the country and coordinates mismatch. Verbatim coordinates are then replaced by the rectified ones in the returned database (a database containing verbatim and corrected coordinates is also created in the “Output” folder). Records near countries coastline are not tested to avoid incur in false positives.

check_pf <-
  bdc_coordinates_transposed(
    data = check_pf,
    id = "database_id",
    sci_names = "scientificName",
    lat = "decimalLatitude",
    lon = "decimalLongitude",
    country = "country",
    countryCode = "countryCode", 
    border_buffer = 0.2 # in decimal degrees (~22 km at the equator)
  )
#> Correcting latitude and longitude transposed
#> Testing coordinate validity
#> Removed 1522 records.
#> Testing coordinate validity
#> Flagged 0 records.
#> Testing sea coordinates
#> Flagged 704 records.
#> Testing country identity
#> Flagged 716 records.
#> Flagged 716 of 7018 records, EQ = 0.1.
#> 716 ocurrences will be tested
#> Processing occurrences from: BR (713)
#> Processing occurrences from: CO (1)
#> Processing occurrences from: MX (1)
#> Processing occurrences from: VE (1)
#> 
#> bdc_coordinates_transposed:
#> Corrected 19 records.
#> One columns were added to the database.
#> Check database containing coordinates corrected in:
#> Output/Check/01_coordinates_transposed.csv

8 - Records outside a region of interest

VALIDATION. Records outside one or multiple reference countries; i.e., records in other countries or at an informed distance from the coast (e.g., in the ocean). This last step avoids flagging as invalid records close to country limits (e.g., records of coast or marshland species).

check_pf <-
  bdc_coordinates_country_inconsistent(
    data = check_pf,
    country_name = "Brazil",
    lon = "decimalLongitude",
    lat = "decimalLatitude",
    dist = 0.1 # in decimal degrees (~11 km at the equator)
  )
#> dist is assumed to be in decimal degrees (arc_degrees).
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> 
#> bdc_coordinates_country_inconsistent:
#> Flagged 658 records.
#> One column was added to the database.

9 - Save records not geo-referenced but with locality information

ENRICHMENT. Coordinates can be derived from a detailed description of the locality associated with records in a process called retrospective geo-referencing.

xyFromLocality <- bdc_coordinates_from_locality(
  data = check_pf,
  locality = "locality",
  lon = "decimalLongitude",
  lat = "decimalLatitude"
)
#> 
#> bdc_coordinates_from_locality 
#> Found 1524 records missing or with invalid coordinates but with potentially useful information on locality.
#>  
#> Check database in: C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Check/01_coordinates_from_locality.csv

Report

Creating a column named “.summary” summarizing the results of all VALIDATION tests. This column is “FALSE” if any test was flagged as “FALSE” (i.e. potentially invalid or suspect record).

check_pf <- bdc_summary_col(data = check_pf)
#> 
#> bdc_summary_col:
#> Flagged 2888 records.
#> One column was added to the database.



Creating a report summarizing the results of all tests.

report <-
  bdc_create_report(data = check_pf,
                    database_id = "database_id",
                    workflow_step = "prefilter")
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the prefilter in:
#> Output/Report

report


Figures

Creating figures (bar plots and maps) to facilitate the interpretation of the results of data quality tests. See some examples below.

bdc_create_figures(data = check_pf,
                   database_id = "database_id",
                   workflow_step = "prefilter")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Figures


Transposed coordinates

Transposed coordinates

Coordinates and contry inconsistent

Coordinates and contry inconsistent

Summary of all tests

Summary of all tests


Filter the database

We can remove records flagged as erroneous or suspect. Records missing names or coordinates, outside a region of interest or from distrustful sources are rarely suitable to be used in biodiversity analyses. We will filter only valid records (flagged as TRUE) using the column “.summary”. Next, we use the bdc_filter_out_falgs function to remove all tests’ columns starting with “.”).

output <-
  check_pf %>%
  dplyr::filter(.summary == TRUE) %>%
  bdc_filter_out_flags(data = ., col_to_remove = "all")
#> 
#> bdc_fiter_out_flags:
#> The following columns were removed from the database:
#> .scientificName_empty, .coordinates_empty, .coordinates_outOfRange, .basisOfRrecords_notStandard, .coordinates_country_inconsistent, .summary

Save the database

output %>%
  qs::qsave(.,
            here::here("Output", "Intermediate", "01_prefilter_database.qs"))